HTML Verison

Background

The trend of e-commerce blows sky high in the recent years, a lot of mortar retail cooperation turns themselves into the competition of e-commerce (e.g. Walmart , Target etc). The concept of this project is to utilize the dataset provided by Walmart to forecast the weekly sales in 45 stores based on set of features (e.g. Store Size,Unemployment rate, Customer Price Index (CPI) etc). The first part of data analysis result is able to reflect the past or current corporation financial condition for either single store or particular department. In addition, this experiment also explores the fact that how every feature impact on the business outcome which is one of crucial operations for a coroperation. In the second part of this analysis, predictive time-series model will be implemented to forecast the future sales in weekly basis. This analysis step is basically an indispensible step for a company to have a guide or estimate of the future direction of company's profit or even market stock price which helps the management team to prepare business stragegy for the next round of period

Client / Audience

The primary audience for this analysis is Walmart management team / executives. The findings from this research gives some hint for them to design their operation format or business plan. Furthermore, the other group of audience properly are those professional who concerns about the business condition about Walmart. The findings from the research definitly offers them some insight about the current or future sales in the mortar stores. The third group would be the people who wants to find out the sales pattern happening in Walmart store or particular department. The analysis also gives the insight how the seasonal factors changes the sales which is very valuable for professionals or merchant .

Questions to be addressed

- Discover the sales distribution in Store and individual department

- Discover the correlation of between features and Sales.

- Discover the features contributes differently to the predictive models

- Determine the best model(s) for forecasting the weekly sales (continuous values)?

- How’s the models perform differently in various States?

Dataset Exploration

Summary

  • There are 421570 instances in this dataset. Among the dataset, there are total 45 retail Stores sales result in this dataset. In each store, there are about 99 Departments.
In [1]:
import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']

display(train.head(3))
Dataset size: (421570, 16)
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Store_Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 2 2010-02-05 50605.27 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
2 1 3 2010-02-05 13740.12 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315

Missing Values

  • Before we get into the analysis, we first deal with the missing values. As we observe the following, the missing values are constantly stacking on Five features which are MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5. Besides, missing values are not found among other features. For this reason, we are going to grub deeper into those features to check where those missing values are correlatable enough to be replaced.

  • The following is the basic proportion of missing values for every store. Based on the table 1, we could observe that the proportion of missing among these five features are ranging between (63%-90%). With these high volumes of missing values, we are going to grub more to check whether the values are randomly missing or not. Table 2 shows that the range of data without missing values from 2011-11-11 to 2012-10-26 and Table 3 shows that the range of data with missing values from 2010-02-05 to 2011-11-04. In this case, we could observe that the large volume of missing values are NOT randomly missed. In the consideration of the huge volumes as well as the missing patterns, I would propose to drop the five features (i.e. columns) instead of records (i.e. rows) because this is important to keep as many records as we could for the time-series analysis.

In [5]:
cntLst = []
print("Table 1")
print('''Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''')
print()
cnt=0
for store in train.Store.unique():
    for var in ["MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"]:
        cntLst.append(train[var][train.Store == store].isnull().sum())
    if cnt<=5:
        print("Store #%d"%(store), end="")
        print(" with %d instance has missing values"%(train[train.Store == store].shape[0]), end=" ")
        print(cntLst, end=" ")
        print("in percent", end=" ")
        print(np.round(np.array(cntLst) / train[train.Store == store].shape[0], 2)*100, end="\n")
        cnt+=1
    elif cnt<=6:
        print("...")
        cnt+=1
    else:
        pass
    cntLst=[]
Table 1
Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store #1 with 10244 instance has missing values [6587, 7229, 6656, 6587, 6587] in percent [64. 71. 65. 64. 64.]
Store #2 with 10238 instance has missing values [6575, 7218, 6646, 6575, 6575] in percent [64. 71. 65. 64. 64.]
Store #3 with 9036 instance has missing values [5791, 6554, 6230, 5922, 5791] in percent [64. 73. 69. 66. 64.]
Store #4 with 10272 instance has missing values [6596, 7171, 6739, 6668, 6596] in percent [64. 70. 66. 65. 64.]
Store #5 with 8999 instance has missing values [5776, 6660, 6279, 5906, 5776] in percent [64. 74. 70. 66. 64.]
Store #6 with 10211 instance has missing values [6559, 7057, 6702, 6559, 6559] in percent [64. 69. 66. 64. 64.]
...
In [7]:
cntLst = []
print("Table 2")
print('''Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == False][train.Store == store].unique()))
    if cnt<=5:
        print("Store %d %s - %s"%(store,min(arr),max(arr)))
        cnt+=1
    elif cnt<=6:
        print("...")
        cnt+=1
    else:
        pass
Table 2
Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 2 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 3 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 4 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 5 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 6 2011-11-11 00:00:00 - 2012-10-26 00:00:00
...
In [8]:
cntLst = []
print("Table 3")
print('''Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == True][train.Store == store].unique()))
    if cnt<=5:
        print("Store %d %s - %s"%(store,min(arr),max(arr)))
        cnt+=1
    elif cnt<=6:
        print("...")
        cnt+=1
    else:
        pass
Table 3
Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 2 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 3 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 4 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 5 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 6 2010-02-05 00:00:00 - 2011-11-04 00:00:00
...

Selected Features

  • isHoliday(bnary)
  • Temperature(continuous)
  • Fuel_Price(continuous)
  • Customer Price Index CPI(continuous)
  • Store Type(Binary)
  • Unemployment Rate(continuous)

Categorical Features

  • This dataset contains two categorical features which are "IsHoliday" and "Store_Type" variables. In order to proceed time-series analysis, these Cat attributes will be dummied into individual variables as Table 4. As you see, the binary "IsHoliday" and "Store_Type" features are divided into 2 columns (i.e. 1,0) and 3 columns (i.e. A,B,C)
In [5]:
print("Table 4")
print('''Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 ''')
#display(pd.get_dummies(train[["IsHoliday_x"]]))
pd.concat((pd.get_dummies(train["Store_Type"]),pd.get_dummies(train["IsHoliday"])), axis=1).head(5)
Table 4
Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 
Out[5]:
A B C False True
0 1 0 0 1 0
1 1 0 0 1 0
2 1 0 0 1 0
3 1 0 0 1 0
4 1 0 0 1 0

Outliers

  • Weekly_Sales: Due to geographical or holiday features, the Sales associated with every store should varies. Refer to Figure 1, the Weekly sales mot only varies quite a bit, a considerablely large amount data point are out of the range of typical range of sales. As mentioned, the variation might be caused by mixed factors. For this reason, I may just leave those "Outliers" as is for now.
In [9]:
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline

init_notebook_mode(connected=True)

namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Weekly_Sales[train.Store == int(name)]/1000,2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Sales Values by each store (Figure 1)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Weekly Sales (in thousands)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-10,200], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)

Outliers

  • Temperature: Refer to figure 2, the other variable "Temperature" looks very consistant across the stores. Also, there are NO outliers significantly out of their typical range among the stores.
In [7]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Temperature[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Temperature Values by each store (Figure 2)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Temperature(F)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-5,120], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Outlier

  • Fuel Price: Refer to figure 3, the feature "Fuel_Price" which is even more consistant than the "Temperature" variable ranges between 2.5 to 4.5 among the stores. So that I would believe that there is NO outliers existing in this feature. This figure makes sense to us because Fuel Price has its market Price across the countries, so that there should be not much variation in each geographical area as expected.
In [8]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Fuel_Price[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Fuel Price Values by each store (Figure 3)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,
                ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Fuel Price($)",
                 titlefont=dict(size=18),
                 zeroline=True, 
                 range=[2,5], 
                 showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Outlier

  • Consumer Price Index (CPI): According to U.S. Bureau of Labor Statistics, the meaning of CPI is "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services". Refer to Figure 4, the CPI value, ranging from 120 to 230, differs quite much among the stores. However, there is NO existing outliers shown in this dataset. This figure relatively indicates us the consuming power across stores. For hypothetical assumption, the area with higher CPI values should expect more consuming power which is supposed to generate more sales amount for the store, or vice versa. Further analysis will be conducted to measure this aspect of feature along with this analysis proceed.
In [9]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.CPI[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of CPI Values by each store (Figure 4)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="CPI",
                 titlefont=dict(size=18),
                 zeroline=True, range=[100,250], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Outlier

  • Unemployment: The unemployment rate is another index shows the consuming power in the country. Refer to the Figure 5, the data shows that most of the values fall between 6 to 8 withouth outliers among the stores. Similar to the "CPI" feature, Further analysis will be conducted to measure how this feature influence to the Weekly Sales among the stores.
In [15]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Unemployment[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Unemployment Rate by each store (Figure 5)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Unemployment Rate",
                 titlefont=dict(size=18),
                 zeroline=True, range=[0,20], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Data Analysis


Weekly Sales Summary

Let's first take a close look on the trend of weekly sales during this two year period. The sales plot captures the sales trend ranges from Feb-2011 - Oct-2012. As we can see that the sales ranges approximate between 0 to 4 million. The sales trends for each store move flat over time except the two holiday seasons which start approximately from Nov 20 to Dec 31 for both years. As we all know, this period covers two big holidays which are Thanksgiving and Christmas retailers traditionally offers huge discount discount to customers. During these high selling period, the sales hike 1.5 - 2 times of the original sales. So that we ought to pay more attention to the prediction for this holiday period.

In [13]:
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline

init_notebook_mode(connected=True)
In [11]:
#Reload Data

import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
Dataset size: (421570, 16)
In [12]:
# Dropping Features
train.drop(columns=["MarkDown1","MarkDown2","MarkDown3","MarkDown4", "MarkDown5"], inplace = True)
In [14]:
#Gather Weekly Sales for each store
namesLst=list(map(str,train.Store.unique()))
dateLst = train.Date.unique()

aa=[]
salesLst=[]
timeLst=[]
storeSalesDict={}
for name in namesLst:
    #print(name)
    storeSalesDict[int(name)]=[]
    for date in dateLst:
        aa=round(train.Weekly_Sales[train.Store == int(name)][train.Date == date].sum()/ 1000000,2)
        storeSalesDict[int(name)].append(aa)
In [15]:
traces=[]

for i, sales in storeSalesDict.items():
    #print(i, sales)
    traces.append(go.Scatter(
    x= pd.to_datetime(dateLst),
    y=sales,
    name="Store#"+str(i)
    ))
In [16]:
import numpy as np
aa= np.zeros((len(storeSalesDict),len(storeSalesDict)))
aa = [[str(aa[i][j]) for i in range(0,aa.shape[0])] for j in range(0,aa.shape[1])]
aa = [[False for i in range(0,len(aa))] for j in range(0,len(aa))]
for i in range(len(aa)):
    aa[i][i] = True
In [17]:
updatemenus = list([
    dict(type="buttons",
         active=-1,
         buttons=list([
            dict(label = 'Store 1',
                 method = 'update',
                 args = [{'visible': aa[0]},
                         {'title': 'Store 1 Sales'}])]))])
In [20]:
layout = go.Layout(
    title = 'Weekly Sales (Figure 6)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=True,
    #width=700,
    #height=1000,
    yaxis =dict(title = "Weekly Sales (in Millions)", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Time Series (in Months)",
                 titlefont=dict(size=18),
                 exponentformat='e',
                 showexponent='all',
                 zeroline=True,  
                 showgrid=False,
                 rangeselector=dict(visible=True, 
                                    buttons=list([
                                        dict(count=1, 
                                             label="1m", 
                                             step="month", 
                                             stepmode ="backward"),
                                        dict(count=3, 
                                                 label="3m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                         dict(count=6, 
                                                 label="6m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                         dict(count=12, 
                                                 label="12m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                        dict(step ="all")])),
                 rangeslider = dict(visible=True)
                ),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False,
    updatemenus=updatemenus
)
In [21]:
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)

Weekly Sales in Store_Type

The figure 7 below shows the total sales generated by different store type (i.e. A, B, C). Obiviously, the sales in regular working holidays takes the major proportion of the sales in holidays because holidays like Christmas and Thanksgiving only takes small portion throughout the year. However, in figure 8, we could observe that the average weekly sales in holidays is higher than weekly sales in working days for Type A, B stores. Sales are basically NO variant for type C store. To be more precise determine their sales performance between holidays or non-holidays, I would propose to set up hypothesis test to further verify.

In [23]:
trace1 = []

salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()

trace1 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA, salesB, salesC],
    name = "Working day"
)

salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()

trace2=[]
trace2 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA, salesB, salesC],
    name = "Holiday"
)

data = [trace1, trace2]
layout = go.Layout(
    title = "Store Type Sale (%s to %s) - Total (Figure 7) "% (min(train.Date), max(train.Date)),
    barmode='group',
    showlegend=True,
    xaxis = dict(title = "Store Type"),
    yaxis = dict(title = " Total Sales")
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
In [25]:
#salesA_len= len(list(train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False]))
salesA_len = len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==False])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesA_avg = salesA_tot/ salesA_len

salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==False])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesB_avg = salesB_tot/ salesB_len

salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==False])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()
salesC_avg = salesC_tot/ salesC_len

trace1 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA_avg, salesB_avg, salesC_avg],
    name = "Working day - " + str(salesA_len) + " days"
)

salesA_len= len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==True])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesA_avg = salesA_tot/ salesA_len

salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==True])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesB_avg = salesB_tot/ salesB_len

salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==True])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()
salesC_avg = salesC_tot/ salesC_len

trace2 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA_avg, salesB_avg, salesC_avg],
    name = "Holiday - "+str(salesA_len) + " days"
)

data = [trace1, trace2]
layout = go.Layout(
    title = "Store Type Sale (%s to %s) - Weekly (Figure 8) "% (min(train.Date), max(train.Date)),
    barmode='group',
    showlegend=True,
    xaxis = dict(title = "Store Type"),
    yaxis = dict(title = " Weekly Sales ")
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)

Weekly Sales in Department

The figure 9 below shows the ranking of Department in terms of Weekly Sales. The highest weekly sales which is about $75000 is generated by Dept 92; The lowest weekly which is -#7.68 is generated by store 47. This plot could demonstrate the sales deficiency between stores.

In [27]:
sales_dept = []
dept_lst=[]
lst=[]
for dept in list(set(train.Dept)):
    lst.append(tuple((dept,round(train.Weekly_Sales[train.Dept == dept].sum()/len(train.Weekly_Sales[train.Dept == dept]),2))))

df = pd.DataFrame(lst).sort_values(by=[1], ascending=True)
dept_lst = list(df.iloc[:,0])
sales_dept = list(df.iloc[:,1])
    
trace3 = []
trace3 = go.Bar(
    x = list(df.iloc[:,1]),
    y = list(map(str,list(df.iloc[:,0]))),
    orientation="h"
    #name = "Working day"
)

data = [trace3]
layout = go.Layout(
    #barmode='group',
    width=700,
    height=1200, 
    #showlegend=True
    title = 'Average Weekly Sales Ranking (Figure 9)',
    yaxis =dict(title = "Department Number", 
                #exponentformat='e',
                #showexponent='all',
                titlefont=dict(size=18),
                tick0=5,
                ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=False,
                type='category'),
    xaxis = dict(title="Weekly Sales($)",
                 titlefont=dict(size=18),
                 zeroline=True, 
                 #range=[2,5], 
                 showgrid=True)
)

annotations=[]
for i in range(len(dept_lst)):
    annotations.append(dict(x=sales_dept[i], 
                            y=dept_lst[i],
                            text="%0.2f"%(sales_dept[i]),
                            font=dict(family='Arial', 
                                      size=12,
                                      color='red'),
                            showarrow=True,
                            align = "center",
                            ax=40,
                            ay=0,
                            arrowhead=0))
    layout['annotations'] = annotations


fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)

Correlation between with Target values

The following scatter plots roughly shows the correlations between the Predictor variables and Target variable.

  • Weekly_Sales vs Temperature
  • Weekly_Sales vs Fuel_Price
  • Weekly_Sales vs CPI
  • Weekly_Sales vs Size

Unfortunately, it is difficult to determine the intensity of correlation between those features against Weekly Sales except the Store Size (i.e. Seems the higher the Store Size leads to higher Store Weekly Sales). To further verify, hypotheses tests will be performed afterward.

In [ ]:
'''*************************No need to Run*****************'''
wsLst=[]
tempLst=[]
fpLst=[]
cpiLst=[]
sizeLst=[]
storeLst=[]
dateLst=[]

for date in list(set(train.Date)):
    print(date)
    for store in list(set(train.Store)):
        storeLst.append(store)
        dateLst.append(date)
        wsLst.append((train.Weekly_Sales[train.Store == store][train.Date == date]).sum())
        tempLst.append(list(set(train.Temperature[train.Store == store][train.Date == date]))[0])
        fpLst.append(list(set(train.Fuel_Price[train.Store == store][train.Date == date]))[0])
        cpiLst.append(list(set(train.CPI[train.Store == store][train.Date == date]))[0])
        sizeLst.append(list(set(train.Size[train.Store == store][train.Date == date]))[0])
        
pd.DataFrame(list(zip(storeLst,dateLst,wsLst, tempLst, fpLst, cpiLst, sizeLst)), 
             columns= ["store","date","Weekly_Sales", "Temperture", "Fuel_Price", "CPI", "Size"]).to_csv("Weekly_Sales_Corr2.csv")        
In [29]:
'''****************Read the Weekly Sales Correlated Feature Table************'''
df = pd.read_csv("Weekly_Sales_Corr2.csv", )
ws=list(df.Weekly_Sales)
temp=list(df.Temperture)
fp=list(df.Fuel_Price)
cpi=list(df.CPI)
size=list(df.Size)
#df.head(5,)
In [31]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from numpy import arange,array,ones
from scipy import stats
import seaborn as sns


plt.figure(figsize=(10,10))
sns.regplot(x="Fuel_Price", y="Weekly_Sales", data=df)
plt.title("Fuel Price vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [34]:
plt.figure(figsize=(10,10))
sns.regplot(x="Temperture", y="Weekly_Sales", data=df)
plt.title("Temperature vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [32]:
plt.figure(figsize=(10,10))
sns.regplot(x="CPI", y="Weekly_Sales", data=df)
plt.title("CPI vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [33]:
plt.figure(figsize=(10,10))
sns.regplot(x="Size", y="Weekly_Sales", data=df)
plt.title("Store Size vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

Heatmap

Dependent vs Independent Variables

With the hint from above correlation plots, the store size (Independent var.) correlates the most with the Store Weekly_Sales (Dependent var). To be able to verify this matter, we are going to show correlation values in numeric format as the following heatmap. The values reflects the same result that store size is the only feature has a correlation value higher than 0.5. The rest of the features show very least correlation with the dependent variable Weekly_sales. In this case, I would suggest to compare the prediction results from the models which are with or without least correlated features. Hypothesis Tests will be proceeded to verify this results.

Independent vs Independent Variables

For the correlation between independent variables, we could also observe that they don't have high correlation. I would not consider to remove any of the features at this point. Same as above, Hypothesis Tests will be proceeded to verify this results.

In [39]:
import matplotlib.pyplot as plt
import seaborn as sns

df2 = df.copy()
df2.drop(columns=["Unnamed: 0","store","date"], inplace=True)

df2_corr =df2.corr()
sns.heatmap(df2_corr, linewidths=0.5, annot=True)
plt.xticks(rotation=45)
plt.title("Correlation Heatmap", fontsize=15)
plt.show()

Hypothesis Test: (Dependent vs Independent Variables)


  • Null Hypothesis H0: Features (i.e. Temperature, Fuel_Price, CPI, Store Size) completely NOT correlates with the Store Weekly Sales.
  • Alt. Hypothesis Ha: Features correlates with the Store Weekly Sales.

According to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that all of the features, except Fuel_Price, correlates with Store Weekly Sales. The result also aligns the correlation values above heatmap, e.g. Correlation of FP vs Weekly_Sales = 0.0095. As mentioned, I would not remove any features during the initial Machine Learning process even though some of the features have low correlation values against Weekly_Sales(Dependant Variable). In addition, we might want to compare the performance of models which are built with or without those least correlated features.

In [40]:
#Calculate pearson correlation function
def pearson_r(x,y):
    corr_mat = np.corrcoef(x,y)
    return corr_mat[0,1]

#Calculate P-value function
def p_value(array, value):
    return (value - np.mean(array))/np.std(array)
In [41]:
xi = temp
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Weekly Sales is", p_val[0,1])
The Z value for Temperature vs Weekly Sales is -5.12599346608834
In [42]:
xi = fp
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel_Price vs Weekly Sales is", p_val[0,1])
The Z value for Fuel_Price vs Weekly Sales is 0.7487650321211525
In [43]:
xi = cpi
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for CPI vs Weekly Sales is", p_val[0,1])
The Z value for CPI vs Weekly Sales is -5.831720613702725
In [44]:
xi = size
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Weekly Sales is", p_val[0,1])
The Z value for Store Size vs Weekly Sales is 65.6174496224247

Hypotheses Test: (Store Size vs Other Independent Variables)


  • Null Hypothesis H0: Store Size completely NOT correlates with other features (i.e. Temperature, Fuel_Price, CPI).
  • Alt. Hypothesis Ha: Store Size correlates with other features.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Store Size correlates with Temperature. On the other hand, Store size does not correlate with other TWO features (i.e. CPI , Fuel Price). The result also aligns the correlation values above heatmap above.

In [45]:
xi = size
yi = cpi

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs CPI is", p_val[0,1])
The Z value for Store Size vs CPI is -0.7824637189185721
In [46]:
xi = size
yi = fp

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Fuel Price is", p_val[0,1])
The Z value for Store Size vs Fuel Price is 0.726107704446794
In [47]:
xi = size
yi = temp

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Temperature is", p_val[0,1])
The Z value for Store Size vs Temperature is -7.463314421009281

Hypotheses Test: (Temperature vs Other Independent Variables)


  • Null Hypothesis H0: Temperature completely NOT correlates with other features (i.e. Fuel_Price, CPI).
  • Alt. Hypothesis Ha: Temperature correlates with other features.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.

In [48]:
xi = fp
yi = temp

No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Fuel Price is", p_val[0,1])
The Z value for Temperature vs Fuel Price is 11.676566479392918
In [49]:
xi = cpi
yi = temp

No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs CPI is", p_val[0,1])
The Z value for Temperature vs CPI is 14.266627971110935

Hypotheses Test: (Fuel Price vs CPI)


  • Null Hypothesis H0: Fuel Price completely NOT correlates with CPI.
  • Alt. Hypothesis Ha: Fuel Price correlates with CPI.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.

In [50]:
xi = cpi
yi = fp

No_bootstrap_trial = len(fp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel Price vs CPI is", p_val[0,1])
The Z value for Fuel Price vs CPI is -13.646755720403766

Hypothesis Summary


The above Hypothesis tests are able to give us some hints how the independent features react with each other and how the correlations between dependent and independent features. The most correlated feature against Weekly Sales (i.e. independent variable) is Store Size and the rest of independent features are comparably less correlated. In this part, we definitly give ourselves some insights prior to the machine learning part. Based on this analysis, we could iterate models with more features combinations in order to search out the most efficient and outperformed model.

Findings from Data Analysis


  • NOT all the Predictor variables are informative for the analysis due to huge amount of missing values which are dropped out (i.e. markdown1 - markdown5) for the data analysis and further predictive modeling.

  • Most of the instances are remained even though outliers values because those data might be reflecting seasonal patterns for the Target variable (i.e. Weekly Sales), e.g. Store sales are constantly higher in holiday seasons

  • Corrlations: Most of features, except Fuel_Price, are correlated with Target variable: Store Weekly Sales.

  • Corrlations: Predictor variables are mostly correlated with each other except Store Size vs Fuel Price.

Modeling and Forecast


  • Utilizes the above analyzed features to build predictive model forecasting the Weekly Sales (continuous) in terms of single store or department
  • Grid search out the best model based on the various measurement metrics (e.g. MSE etc).
  • Proposed Methodologies for modeling:
    • Baseline model: Linear Regression
    • Linear models: RidgeCV, LassoCV,
    • Regressors: XGB Regressor, SGD Regressor, Bayesian Regressor,Random Forest Regressor, Gradient Boosting Regressor, Adaboost Regressor

Machine Learning

In [1]:
import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']

display(train.head(3))
Dataset size: (421570, 16)
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Store_Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 2 2010-02-05 50605.27 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
2 1 3 2010-02-05 13740.12 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315

Dummy Categortical Variables

In [2]:
#Copy original dataset 
train2 = train.copy()
train2 = train.drop(["MarkDown1", "MarkDown2","MarkDown3","MarkDown4", "MarkDown4", "MarkDown5"], axis=1)

###Dummy out Categor
train2=pd.concat((train2,pd.get_dummies(train2["IsHoliday"] , prefix="IsHoliday")),axis=1)
train2 = pd.concat((train2,pd.get_dummies(train2["Store_Type"] , prefix="Store_Type")),axis=1)
train2.drop(["IsHoliday","Store_Type"],axis=1, inplace = True)
train2.columns
train2.head(3)
Out[2]:
Store Dept Date Weekly_Sales Temperature Fuel_Price CPI Unemployment Size IsHoliday_False IsHoliday_True Store_Type_A Store_Type_B Store_Type_C
0 1 1 2010-02-05 24924.50 42.31 2.572 211.096358 8.106 151315 1 0 1 0 0
1 1 2 2010-02-05 50605.27 42.31 2.572 211.096358 8.106 151315 1 0 1 0 0
2 1 3 2010-02-05 13740.12 42.31 2.572 211.096358 8.106 151315 1 0 1 0 0
In [ ]:
# load additional module
import pickle

storeLst=[]
for store_no in sorted(list(set(train2.Store))): 
    for date in sorted(list(set(train2.Date))):
        item=train2[train2.Store==store_no][train2.Date == date]
        storeLst.append(list([store_no,
                      date,
                      round(list(set(item.Temperature))[0],2),
                      round(list(set(item.Fuel_Price))[0],3),
                      round(list(set(item.CPI))[0],5),              
                      round(list(set(item.Unemployment))[0],5),
                      list(set(item.Size))[0],
                      list(set(item.IsHoliday_False))[0],
                      list(set(item.IsHoliday_True))[0],              
                      list(set(item.Store_Type_A))[0], 
                      list(set(item.Store_Type_B))[0], 
                      list(set(item.Store_Type_C))[0], 
                      round(np.divide(item.Weekly_Sales.sum(), 1000000),5)]))
        
with open('train2.data', 'wb') as filehandle:  
    # store the data as binary data stream
    pickle.dump(storeLst, filehandle)        

Load aggregated data

In [1]:
import pandas as pd
import numpy as np
import pickle
with open('train2.data', 'rb') as filehandle:  
    # read the data as binary data stream
    storeLst = pickle.load(filehandle)
    
train3=pd.DataFrame(storeLst, columns=["store_no", "date","temp","fuel_price","cpi", "Unemploy","store_size",
                                "holiday_f","holiday_t", "type_A", "type_B", "type_C","wkly_sales" ])
train3.head(3)
Out[1]:
store_no date temp fuel_price cpi Unemploy store_size holiday_f holiday_t type_A type_B type_C wkly_sales
0 1 2010-02-05 42.31 2.572 211.09636 8.106 151315 1 0 1 0 0 1.64369
1 1 2010-02-12 38.51 2.548 211.24217 8.106 151315 0 1 1 0 0 1.64196
2 1 2010-02-19 39.93 2.514 211.28914 8.106 151315 1 0 1 0 0 1.61197

Split the date column into year, month, day columns

In [2]:
new=train3.date.str.split("-",n=2,expand=True)
train3["year"] = new[0]
train3["month"] = new[1]
train3["day"] =new[2]
train3.head()
Out[2]:
store_no date temp fuel_price cpi Unemploy store_size holiday_f holiday_t type_A type_B type_C wkly_sales year month day
0 1 2010-02-05 42.31 2.572 211.09636 8.106 151315 1 0 1 0 0 1.64369 2010 02 05
1 1 2010-02-12 38.51 2.548 211.24217 8.106 151315 0 1 1 0 0 1.64196 2010 02 12
2 1 2010-02-19 39.93 2.514 211.28914 8.106 151315 1 0 1 0 0 1.61197 2010 02 19
3 1 2010-02-26 46.63 2.561 211.31964 8.106 151315 1 0 1 0 0 1.40973 2010 02 26
4 1 2010-03-05 46.50 2.625 211.35014 8.106 151315 1 0 1 0 0 1.55481 2010 03 05

Split Train and Test set

The dataset is splited into 64% and 36% proportion for Training and Testing respectively. The cut-off date is set as "2011-11-18" which means data prior to the cut-off date is used for training, otherwise data is used for testing trained model.

In [3]:
train4=train3[train3.date <= "2011-11-18"]
test4=train3[train3.date > "2011-11-18"]

Function for convert the dataset with number of Windows Slides (i.e. Timestep)

In [4]:
from sklearn import preprocessing

#data is pandas dataframe
def generate_x_y(data,timestep, normalize=True):
    timestep=timestep
    min_max_scaler = preprocessing.MinMaxScaler()
    ydata=[]
    date_lst=[]
    store_lst=[]
    lst=[]
    for store_no in sorted(list(set(data.store_no))):
       # print(store_no)
        ydata.append(data["wkly_sales"][data.store_no==store_no].iloc[timestep-1:])
        date_lst.append(data["date"][data.store_no==store_no].iloc[timestep-1:])
        store_lst.append(data["store_no"][data.store_no==store_no].iloc[timestep-1:])
        arr =np.array(data[data.store_no==store_no])
        reshape_length = timestep * arr.shape[1]
        for i in range(arr.shape[0]-timestep+1):
            s=arr[i:i+timestep]
            lst.append(s.reshape(reshape_length))


    df=pd.DataFrame(lst , columns=list(data.columns)*timestep)
    xdata=df.drop(["store_no", "date", "wkly_sales"], axis=1)
    ydata=np.array(ydata).flatten()
    date_lst=np.array(date_lst).flatten()
    store_lst=np.array(store_lst).flatten()
    
    # Normalize X data
    if normalize == True:
        normalized_xdata=min_max_scaler.fit_transform(xdata)
        return(normalized_xdata,np.array(ydata), date_lst, store_lst)
    else:
        return(np.array(xdata),np.array(ydata), date_lst, store_lst)

Function to Plot Train vs Test Score

In [5]:
import matplotlib.pyplot as plt
%matplotlib inline

def scorePlot(window_slide_lst, train_score_lst, test_score_lst, modelName, ylabel):
    df= pd.DataFrame(list(zip(window_slide_lst,
                              train_score_lst,
                              test_score_lst
                              )), columns=["timestep","train_score", "test_score"
                                                          ])
    df=df.set_index("timestep")


    df.plot(use_index=True, figsize=(10,5))
    plt.title(modelName)
    plt.ylabel (ylabel)
    plt.xlabel("Timestep / Window_Slide")
    plt.xticks(window_slide_lst)
    plt.show()

Measurement Metrics

Before we go further to the machine learning part, let's get to know how would we measure the performance of the models that we are goind to build. The main primary metrics we are utilizing to measure the performance are $R^2$, Mean Square Error (MSE), Mearn Absolute Error (MAE).

  • $ R^2 $ is the coefficient of determination shows the percentage of variance in the Target variable could be explained by the predictor variables. Regardless the complexity of modelling, The score is calulated the difference between the predicted Y and true Y. Theorethically, The best score is 1; the worst is 0. In our case, we used the .score API from sklearn to achieve this task. The formula is shown as the following and further detail is provide here. $$R^2 = 1-\frac{ \sum_{i=1}^N ((yTrue - yPred)^2) }{ \sum_{i=1}^N ((yTrue - \bar{yTrue})^2)}$$

  • $MAE$ express the Average Absolute Error of target predicted values generalized by the model. The figure proves that the higher the MAE value means higher the error on the predicted Y. This metric has lower bound limit which is ZERO occurs when there is NO errors generalize; this metric has NO upper bound limit. MAE which is a steady metric generally offers us a way to oberve constant range of errors. The formula is shown as the following and further detail is provided here. MSE=1N∑i=1N(yTrue−yPred)2 $$MAE = \frac{1}{N}\sum_{i=1}^N |yTrue - yPred| $$

  • $MSE$ express the Average Square Error of target predicted valued generalized by the model. This figure also shows the higher MSE value means higher the error on the predicted Y. This metric also has ZERO lower bound limit and NO upper bound limit. However, this figure would have show higher value comparing with MAE due to its Square in the formula. This nature of property would offer a way to observe larger errors happens in the prediction. The formula is shown as the following and further detail is provided here. $$MSE = \frac{1}{N}\sum_{i=1}^N (yTrue - yPred)^2 $$

Build model

In [6]:
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error

Baseline model: Linear Regression

This Linear Regression model is our baseline model without filtering any features out of original data. We will perform more work to check if there are more models offering higher accuracy on generalized result. The following plot shows the MSE/ MAE Metric of Linear model in a very stable trace from time-step 2 to 6. The MSE values are generally less than the MAE values which is predictable because they are in decimal form and values of MSE are closed to squared form of the values of MAE. Also, there is NO any large MSE values found from the plot which means that we could assume NOT much variance on the frequency distribution of errors. According to this result, model with timestep = 2 would be the optimal in terms of model complexity and values of error. For this reason, we will dive deeper to prune parameters and features in order to optimize the accuracy better.

In [7]:
from sklearn.linear_model import LinearRegression

train_reg_score_Lst=[]
test_reg_score_Lst=[]
test_reg_mse_Lst=[]
test_reg_mae_Lst=[]
window_slide_Lst =np.arange(2,7)

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    reg = LinearRegression().fit(xtrain, ytrain)
    a=reg.score(xtrain, ytrain)
    b=reg.score(xtest, ytest)
    c=mean_squared_error(ytest,reg.predict(xtest) )
    d=mean_absolute_error(ytest,reg.predict(xtest) )
    
    train_reg_score_Lst.append(a)
    test_reg_score_Lst.append(b)
    test_reg_mse_Lst.append(c)
    test_reg_mae_Lst.append(d)
#pd.DataFrame(list(zip(test_reg_mse_Lst,test_reg_mae_Lst)), columns=["MSE", "MAE"])        
In [8]:
df= pd.DataFrame(list(zip(window_slide_Lst,
                          test_reg_mse_Lst,
                          test_reg_mae_Lst
                          )), columns=["timestep", "MSE", "MAE"])
df=df.set_index("timestep")


df.plot(use_index=True, figsize=(15,4))
plt.title("Linear Regression (BasedLine Model)")
plt.ylabel ("MSE / MAE")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,7))
plt.show()
In [ ]:
scorePlot(window_slide_Lst, train_reg_score_Lst, test_reg_score_Lst, "Linear Regression")

Recheck the correlation between Predictor and Response Variables

In [10]:
import seaborn as sns
plt.figure(figsize=(10,10))
sns.heatmap(train4.iloc[:, 2:].corr(), linewidths=0.5, annot=True)
plt.show()

Linear Regression (OLS):

Based on the above baseline model, we are going to exam how the Response variable (i.e. Weekly Sales) react with combined set of predictor variables. The above correlation table shows that "store_size", "type_A" and "type_C" variables have the largest positive correlation with Target variable which is "wkly_sales". So that we starts off the modelling with Full set of feature and then narrow down into these three variables.

In [11]:
# Import regression modules
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.api as sm
In [13]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, 
                                                              timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, 
                                                          timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
print(m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.680
Model:                            OLS   Adj. R-squared:                  0.679
Method:                 Least Squares   F-statistic:                     465.9
Date:                Wed, 23 Jan 2019   Prob (F-statistic):               0.00
Time:                        13:49:56   Log-Likelihood:                -1103.0
No. Observations:                4185   AIC:                             2246.
Df Residuals:                    4165   BIC:                             2373.
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.2018      0.090     -2.240      0.025      -0.379      -0.025
x2            -0.1192      0.166     -0.719      0.472      -0.444       0.206
x3            -1.9600      5.016     -0.391      0.696     -11.793       7.873
x4             2.2702      0.876      2.591      0.010       0.553       3.988
x5             0.7148      0.012     60.881      0.000       0.692       0.738
x6             0.1006      0.028      3.540      0.000       0.045       0.156
x7             0.1179      0.031      3.805      0.000       0.057       0.179
x8             0.0372      0.020      1.899      0.058      -0.001       0.076
x9             0.0553      0.018      2.990      0.003       0.019       0.092
x10            0.1260      0.019      6.486      0.000       0.088       0.164
x11           -2.0714      4.436     -0.467      0.641     -10.767       6.625
x12           -1.9108      4.056     -0.471      0.638      -9.862       6.041
x13           -0.1717      0.365     -0.470      0.638      -0.887       0.544
x14            0.2697      0.090      3.001      0.003       0.093       0.446
x15            0.2876      0.165      1.747      0.081      -0.035       0.610
x16            1.8670      5.027      0.371      0.710      -7.989      11.723
x17           -2.4626      0.875     -2.814      0.005      -4.179      -0.747
x18            0.7148      0.012     60.881      0.000       0.692       0.738
x19            0.0760      0.029      2.634      0.008       0.019       0.133
x20            0.1425      0.030      4.732      0.000       0.083       0.202
x21            0.0372      0.020      1.899      0.058      -0.001       0.076
x22            0.0553      0.018      2.990      0.003       0.019       0.092
x23            0.1260      0.019      6.486      0.000       0.088       0.164
x24            1.9532      4.436      0.440      0.660      -6.743      10.650
x25            1.9962      4.057      0.492      0.623      -5.958       9.951
x26            0.1342      0.364      0.368      0.713      -0.580       0.849
==============================================================================
Omnibus:                     1069.458   Durbin-Watson:                   0.306
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3861.233
Skew:                           1.243   Prob(JB):                         0.00
Kurtosis:                       6.995   Cond. No.                     1.79e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.6e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [14]:
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )

print(a,b,c,d)
0.6795516077917627 0.6722065054368919 0.1058079394388538 0.23998274421296298
In [15]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                          timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
print(m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     1665.
Date:                Wed, 23 Jan 2019   Prob (F-statistic):               0.00
Time:                        13:50:07   Log-Likelihood:                -1194.2
No. Observations:                4185   AIC:                             2400.
Df Residuals:                    4179   BIC:                             2438.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.7231      0.012     61.053      0.000       0.700       0.746
x2             0.1378      0.011     12.718      0.000       0.117       0.159
x3             0.1623      0.008     21.145      0.000       0.147       0.177
x4             0.2297      0.010     23.765      0.000       0.211       0.249
x5             1.8227      0.823      2.216      0.027       0.210       3.435
x6             0.7231      0.012     61.053      0.000       0.700       0.746
x7             0.1378      0.011     12.718      0.000       0.117       0.159
x8             0.1623      0.008     21.145      0.000       0.147       0.177
x9             0.2297      0.010     23.765      0.000       0.211       0.249
x10           -1.9036      0.822     -2.316      0.021      -3.515      -0.292
==============================================================================
Omnibus:                     1179.201   Durbin-Watson:                   0.287
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4822.740
Skew:                           1.331   Prob(JB):                         0.00
Kurtosis:                       7.536   Cond. No.                     4.94e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.26e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [16]:
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )

print(a,b,c,d)
0.66538664895962 0.6671660713904171 0.10743493310767493 0.23694491485821756
In [17]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C" ,"wkly_sales"]], 
                                                              timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                          timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
print(m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.665
Model:                            OLS   Adj. R-squared:                  0.664
Method:                 Least Squares   F-statistic:                     2762.
Date:                Wed, 23 Jan 2019   Prob (F-statistic):               0.00
Time:                        13:50:19   Log-Likelihood:                -1201.4
No. Observations:                4185   AIC:                             2411.
Df Residuals:                    4181   BIC:                             2436.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.7215      0.012     60.896      0.000       0.698       0.745
x2             0.1247      0.010     12.722      0.000       0.105       0.144
x3             0.1482      0.006     25.218      0.000       0.137       0.160
x4             0.2103      0.007     30.759      0.000       0.197       0.224
x5             0.7215      0.012     60.896      0.000       0.698       0.745
x6             0.1247      0.010     12.722      0.000       0.105       0.144
x7             0.1482      0.006     25.218      0.000       0.137       0.160
x8             0.2103      0.007     30.759      0.000       0.197       0.224
==============================================================================
Omnibus:                     1179.386   Durbin-Watson:                   0.288
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4741.471
Skew:                           1.337   Prob(JB):                         0.00
Kurtosis:                       7.477   Cond. No.                     2.41e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.21e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [18]:
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )

print(a,b,c,d)
0.664204645216192 0.6638633783012173 0.10850100414374882 0.23813050079571763
In [19]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C" ,"wkly_sales"]], 
                                                              timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                          timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
m.summary()
Out[19]:
OLS Regression Results
Dep. Variable: y R-squared: 0.913
Model: OLS Adj. R-squared: 0.913
Method: Least Squares F-statistic: 1.469e+04
Date: Wed, 23 Jan 2019 Prob (F-statistic): 0.00
Time: 13:50:22 Log-Likelihood: -1497.6
No. Observations: 4185 AIC: 3001.
Df Residuals: 4182 BIC: 3020.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.9378 0.009 106.849 0.000 0.921 0.955
x2 -0.0419 0.008 -5.395 0.000 -0.057 -0.027
x3 0.2037 0.007 27.777 0.000 0.189 0.218
x4 0.9378 0.009 106.849 0.000 0.921 0.955
x5 -0.0419 0.008 -5.395 0.000 -0.057 -0.027
x6 0.2037 0.007 27.777 0.000 0.189 0.218
Omnibus: 714.453 Durbin-Watson: 0.252
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1922.359
Skew: 0.919 Prob(JB): 0.00
Kurtosis: 5.764 Cond. No. 5.45e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.3e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [20]:
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )

print(a,b,c,d)
0.6638423438435905 0.6635230790179416 0.10861084880678559 0.2383197902274869

XGB Regressor

In [21]:
window_slide_Lst =np.arange(2,11)
In [22]:
from xgboost import XGBRegressor 

train_xgb_score_Lst1=[]
test_xgb_score_Lst1=[]
test_xgb_mse_Lst1=[]
test_xgb_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4,
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, 
                                                              timestep=window_slide, normalize=True)

    xgb = XGBRegressor()
    xgb.fit(xtrain, ytrain)
    a=xgb.score(xtrain, ytrain)
    b=xgb.score(xtest, ytest)
    c=mean_squared_error(ytest,xgb.predict(xtest) )
    d=mean_absolute_error(ytest,xgb.predict(xtest) )    

    train_xgb_score_Lst1.append(a)
    test_xgb_score_Lst1.append(b)
    test_xgb_mse_Lst1.append(c)
    test_xgb_mae_Lst1.append(d)
pd.DataFrame(list(zip(window_slide_Lst,test_xgb_mse_Lst1,test_xgb_mae_Lst1)), columns=["timestep","MSE", "MAE"])    
Out[22]:
timestep MSE MAE
0 2 0.024393 0.111607
1 3 0.023672 0.110676
2 4 0.023875 0.110933
3 5 0.024198 0.112904
4 6 0.022067 0.108845
5 7 0.030632 0.123083
6 8 0.037213 0.132535
7 9 0.036601 0.130795
8 10 0.049572 0.147222
In [226]:
scorePlot(window_slide_Lst, train_xgb_score_Lst, test_xgb_score_Lst, "XGB Regressor", "Score")
In [23]:
from xgboost import XGBRegressor 

train_xgb_score_Lst2=[]
test_xgb_score_Lst2=[]
test_xgb_mse_Lst2=[]
test_xgb_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    xgb = XGBRegressor()
    xgb.fit(xtrain, ytrain)
    a=xgb.score(xtrain, ytrain)
    b=xgb.score(xtest, ytest)
    c=mean_squared_error(ytest,xgb.predict(xtest) )
    d=mean_absolute_error(ytest,xgb.predict(xtest) )    

    train_xgb_score_Lst2.append(a)
    test_xgb_score_Lst2.append(b)
    test_xgb_mse_Lst2.append(c)
    test_xgb_mae_Lst2.append(d)
#pd.DataFrame(list(zip(test_xgb_mse_Lst2,test_xgb_mae_Lst2)), columns=["MSE", "MAE"])    
In [24]:
from xgboost import XGBRegressor 

train_xgb_score_Lst3=[]
test_xgb_score_Lst3=[]
test_xgb_mse_Lst3=[]
test_xgb_mae_Lst3=[]


#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no","store_size", "type_A","type_B", "type_C","wkly_sales"]],
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size", "type_A","type_B", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    xgb = XGBRegressor()
    xgb.fit(xtrain, ytrain)
    a=xgb.score(xtrain, ytrain)
    b=xgb.score(xtest, ytest)
    c=mean_squared_error(ytest,xgb.predict(xtest) )
    d=mean_absolute_error(ytest,xgb.predict(xtest) )    

    train_xgb_score_Lst3.append(a)
    test_xgb_score_Lst3.append(b)
    test_xgb_mse_Lst3.append(c)
    test_xgb_mae_Lst3.append(d)
#pd.DataFrame(list(zip(test_xgb_mse_Lst3,test_xgb_mae_Lst3)), columns=["MSE", "MAE"])    
In [25]:
from xgboost import XGBRegressor 

train_xgb_score_Lst4=[]
test_xgb_score_Lst4=[]
test_xgb_mse_Lst4=[]
test_xgb_mae_Lst4=[]


#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no","store_size", "type_A", "type_C","wkly_sales"]],
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size", "type_A", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    xgb = XGBRegressor()
    xgb.fit(xtrain, ytrain)
    a=xgb.score(xtrain, ytrain)
    b=xgb.score(xtest, ytest)
    c=mean_squared_error(ytest,xgb.predict(xtest) )
    d=mean_absolute_error(ytest,xgb.predict(xtest) )    

    train_xgb_score_Lst4.append(a)
    test_xgb_score_Lst4.append(b)
    test_xgb_mse_Lst4.append(c)
    test_xgb_mae_Lst4.append(d)
#pd.DataFrame(list(zip(test_xgb_mse_Lst4,test_xgb_mae_Lst4)), columns=["MSE", "MAE"])    
In [26]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_xgb_mse_Lst1,test_xgb_mae_Lst1,
                             test_xgb_mse_Lst2,test_xgb_mae_Lst2,
                             test_xgb_mse_Lst3,test_xgb_mae_Lst3,
                             test_xgb_mse_Lst4,test_xgb_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [27]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("XGB Error")
plt.ylabel("MSE / MAE")
plt.show()
In [209]:
scorePlot(window_slide_Lst, train_xgb_score_Lst, test_xgb_score_Lst, "XGB Regressor")

RidgeCV

In [28]:
from sklearn.linear_model import RidgeCV

train_ridge_score_Lst1=[]
test_ridge_score_Lst1=[]
test_ridge_mse_Lst1=[]
test_ridge_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
    ridge.fit(xtrain, ytrain)
    a=ridge.score(xtrain, ytrain)
    b=ridge.score(xtest, ytest)
    c=mean_squared_error(ytest,ridge.predict(xtest) )
    d=mean_absolute_error(ytest,ridge.predict(xtest) )

    
    train_ridge_score_Lst1.append(a)
    test_ridge_score_Lst1.append(b)
    test_ridge_mse_Lst1.append(c)
    test_ridge_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_ridge_mse_Lst1,test_ridge_mae_Lst1)), columns=["MSE", "MAE"])        
In [29]:
train_ridge_score_Lst2=[]
test_ridge_score_Lst2=[]
test_ridge_mse_Lst2=[]
test_ridge_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
    ridge.fit(xtrain, ytrain)
    a=ridge.score(xtrain, ytrain)
    b=ridge.score(xtest, ytest)
    c=mean_squared_error(ytest,ridge.predict(xtest) )
    d=mean_absolute_error(ytest,ridge.predict(xtest) )
    
    train_ridge_score_Lst2.append(a)
    test_ridge_score_Lst2.append(b)
    test_ridge_mse_Lst2.append(c)
    test_ridge_mae_Lst2.append(d)    
In [30]:
train_ridge_score_Lst3=[]
test_ridge_score_Lst3=[]
test_ridge_mse_Lst3=[]
test_ridge_mae_Lst3=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
    ridge.fit(xtrain, ytrain)
    a=ridge.score(xtrain, ytrain)
    b=ridge.score(xtest, ytest)
    c=mean_squared_error(ytest,ridge.predict(xtest) )
    d=mean_absolute_error(ytest,ridge.predict(xtest) )
    
    train_ridge_score_Lst3.append(a)
    test_ridge_score_Lst3.append(b)
    test_ridge_mse_Lst3.append(c)
    test_ridge_mae_Lst3.append(d)    
In [31]:
train_ridge_score_Lst4=[]
test_ridge_score_Lst4=[]
test_ridge_mse_Lst4=[]
test_ridge_mae_Lst4=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
    ridge.fit(xtrain, ytrain)
    a=ridge.score(xtrain, ytrain)
    b=ridge.score(xtest, ytest)
    c=mean_squared_error(ytest,ridge.predict(xtest) )
    d=mean_absolute_error(ytest,ridge.predict(xtest) )
    
    train_ridge_score_Lst4.append(a)
    test_ridge_score_Lst4.append(b)
    test_ridge_mse_Lst4.append(c)
    test_ridge_mae_Lst4.append(d)    
In [32]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_ridge_mse_Lst1,test_ridge_mae_Lst1,
                             test_ridge_mse_Lst2,test_ridge_mae_Lst2,
                             test_ridge_mse_Lst3,test_ridge_mae_Lst3,
                             test_ridge_mse_Lst4,test_ridge_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [33]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Ridge Error")
plt.ylabel("MSE / MAE")
plt.show()
In [ ]:
scorePlot(window_slide_Lst, train_lasso_score_Lst, test_lasso_score_Lst, "Lasso")

LassoCV

In [79]:
from sklearn.linear_model import LassoCV

train_lasso_score_Lst1=[]
test_lasso_score_Lst1=[]
test_lasso_mse_Lst1=[]
test_lasso_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
    lasso.fit(xtrain, ytrain)
    a=lasso.score(xtrain, ytrain)
    b=lasso.score(xtest, ytest)
    c=mean_squared_error(ytest,lasso.predict(xtest) )
    d=mean_absolute_error(ytest,lasso.predict(xtest) )

    
    train_lasso_score_Lst1.append(a)
    test_lasso_score_Lst1.append(b)
    test_lasso_mse_Lst1.append(c)
    test_lasso_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_lasso_mse_Lst1,test_lasso_mae_Lst1)), columns=["MSE", "MAE"])        
/Users/KevQuant/anaconda/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
In [80]:
train_lasso_score_Lst2=[]
test_lasso_score_Lst2=[]
test_lasso_mse_Lst2=[]
test_lasso_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
    lasso.fit(xtrain, ytrain)
    a=lasso.score(xtrain, ytrain)
    b=lasso.score(xtest, ytest)
    c=mean_squared_error(ytest,lasso.predict(xtest) )
    d=mean_absolute_error(ytest,lasso.predict(xtest) )
    
    train_lasso_score_Lst2.append(a)
    test_lasso_score_Lst2.append(b)
    test_lasso_mse_Lst2.append(c)
    test_lasso_mae_Lst2.append(d)    
In [81]:
train_lasso_score_Lst3=[]
test_lasso_score_Lst3=[]
test_lasso_mse_Lst3=[]
test_lasso_mae_Lst3=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
    lasso.fit(xtrain, ytrain)
    a=lasso.score(xtrain, ytrain)
    b=lasso.score(xtest, ytest)
    c=mean_squared_error(ytest,lasso.predict(xtest) )
    d=mean_absolute_error(ytest,lasso.predict(xtest) )
    
    train_lasso_score_Lst3.append(a)
    test_lasso_score_Lst3.append(b)
    test_lasso_mse_Lst3.append(c)
    test_lasso_mae_Lst3.append(d)    
In [82]:
train_lasso_score_Lst4=[]
test_lasso_score_Lst4=[]
test_lasso_mse_Lst4=[]
test_lasso_mae_Lst4=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
    lasso.fit(xtrain, ytrain)
    a=lasso.score(xtrain, ytrain)
    b=lasso.score(xtest, ytest)
    c=mean_squared_error(ytest,lasso.predict(xtest) )
    d=mean_absolute_error(ytest,lasso.predict(xtest) )
    
    train_lasso_score_Lst4.append(a)
    test_lasso_score_Lst4.append(b)
    test_lasso_mse_Lst4.append(c)
    test_lasso_mae_Lst4.append(d)    
In [85]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_lasso_mse_Lst1,test_lasso_mae_Lst1,
                             test_lasso_mse_Lst2,test_lasso_mae_Lst2,
                             test_lasso_mse_Lst3,test_lasso_mae_Lst3,
                             test_lasso_mse_Lst4,test_lasso_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [86]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Lasso Error")
plt.ylabel("MSE / MAE")
plt.show()

SGD

In [46]:
from sklearn import linear_model

train_sgd_score_Lst1=[]
test_sgd_score_Lst1=[]
test_sgd_mse_Lst1=[]
test_sgd_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, 
                                                              timestep=window_slide, normalize=True)

    sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
    sgd.fit(xtrain,ytrain)
    a=sgd.score(xtrain, ytrain)
    b=sgd.score(xtest, ytest)
    c=mean_squared_error(ytest,sgd.predict(xtest) )
    d=mean_absolute_error(ytest,sgd.predict(xtest) )

    
    train_sgd_score_Lst1.append(a)
    test_sgd_score_Lst1.append(b)
    test_sgd_mse_Lst1.append(c)
    test_sgd_mae_Lst1.append(d)
In [ ]:
scorePlot(window_slide_Lst, train_sgd_score_Lst, test_sgd_score_Lst, "SGD Regressor")
In [47]:
from sklearn import linear_model

train_sgd_score_Lst2=[]
test_sgd_score_Lst2=[]
test_sgd_mse_Lst2=[]
test_sgd_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
    sgd.fit(xtrain,ytrain)
    a=sgd.score(xtrain, ytrain)
    b=sgd.score(xtest, ytest)
    c=mean_squared_error(ytest,sgd.predict(xtest) )
    d=mean_absolute_error(ytest,sgd.predict(xtest) )

    
    train_sgd_score_Lst2.append(a)
    test_sgd_score_Lst2.append(b)
    test_sgd_mse_Lst2.append(c)
    test_sgd_mae_Lst2.append(d)
In [48]:
from sklearn import linear_model

train_sgd_score_Lst3=[]
test_sgd_score_Lst3=[]
test_sgd_mse_Lst3=[]
test_sgd_mae_Lst3=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
    sgd.fit(xtrain,ytrain)
    a=sgd.score(xtrain, ytrain)
    b=sgd.score(xtest, ytest)
    c=mean_squared_error(ytest,sgd.predict(xtest) )
    d=mean_absolute_error(ytest,sgd.predict(xtest) )

    
    train_sgd_score_Lst3.append(a)
    test_sgd_score_Lst3.append(b)
    test_sgd_mse_Lst3.append(c)
    test_sgd_mae_Lst3.append(d)
In [74]:
from sklearn import linear_model

train_sgd_score_Lst4=[]
test_sgd_score_Lst4=[]
test_sgd_mse_Lst4=[]
test_sgd_mae_Lst4=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
    sgd.fit(xtrain,ytrain)
    a=sgd.score(xtrain, ytrain)
    b=sgd.score(xtest, ytest)
    c=mean_squared_error(ytest,sgd.predict(xtest) )
    d=mean_absolute_error(ytest,sgd.predict(xtest) )

    
    train_sgd_score_Lst4.append(a)
    test_sgd_score_Lst4.append(b)
    test_sgd_mse_Lst4.append(c)
    test_sgd_mae_Lst4.append(d)
In [75]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_sgd_mse_Lst1,test_sgd_mae_Lst1,
                             test_sgd_mse_Lst2,test_sgd_mae_Lst2,
                             test_sgd_mse_Lst3,test_sgd_mae_Lst3,
                             test_sgd_mse_Lst4,test_sgd_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [76]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("sgd Error")
plt.ylabel("MSE / MAE")
plt.show()

Bayesian Regressor

In [49]:
from sklearn import linear_model

train_bay_score_Lst1=[]
test_bay_score_Lst1=[]
test_bay_mse_Lst1=[]
test_bay_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4, 
                                                                    timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4, 
                                                                timestep=window_slide, normalize=True)

    bay = linear_model.BayesianRidge()
    bay.fit(xtrain, ytrain)
    a=bay.score(xtrain, ytrain)
    b=bay.score(xtest, ytest)
    c=mean_squared_error(ytest,bay.predict(xtest) )
    d=mean_absolute_error(ytest,bay.predict(xtest) )

    
    train_bay_score_Lst1.append(a)
    test_bay_score_Lst1.append(b)
    test_bay_mse_Lst1.append(c)
    test_bay_mae_Lst1.append(d)    
    
#pd.DataFrame(list(zip(test_bay_mse_Lst1,test_bay_mae_Lst1)), columns=["MSE", "MAE"])                    
In [50]:
from sklearn import linear_model

train_bay_score_Lst2=[]
test_bay_score_Lst2=[]
test_bay_mse_Lst2=[]
test_bay_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    bay = linear_model.BayesianRidge()
    bay.fit(xtrain, ytrain)
    a=bay.score(xtrain, ytrain)
    b=bay.score(xtest, ytest)
    c=mean_squared_error(ytest,bay.predict(xtest) )
    d=mean_absolute_error(ytest,bay.predict(xtest) )

    
    train_bay_score_Lst2.append(a)
    test_bay_score_Lst2.append(b)
    test_bay_mse_Lst2.append(c)
    test_bay_mae_Lst2.append(d)    
    
#pd.DataFrame(list(zip(test_bay_mse_Lst2,test_bay_mae_Lst2)), columns=["MSE", "MAE"])                    
In [51]:
from sklearn import linear_model

train_bay_score_Lst3=[]
test_bay_score_Lst3=[]
test_bay_mse_Lst3=[]
test_bay_mae_Lst3=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    bay = linear_model.BayesianRidge()
    bay.fit(xtrain, ytrain)
    a=bay.score(xtrain, ytrain)
    b=bay.score(xtest, ytest)
    c=mean_squared_error(ytest,bay.predict(xtest) )
    d=mean_absolute_error(ytest,bay.predict(xtest) )

    
    train_bay_score_Lst3.append(a)
    test_bay_score_Lst3.append(b)
    test_bay_mse_Lst3.append(c)
    test_bay_mae_Lst3.append(d)    
    
#pd.DataFrame(list(zip(test_bay_mse_Lst3,test_bay_mae_Lst3)), columns=["MSE", "MAE"])                    
In [52]:
from sklearn import linear_model

train_bay_score_Lst4=[]
test_bay_score_Lst4=[]
test_bay_mse_Lst4=[]
test_bay_mae_Lst4=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    bay = linear_model.BayesianRidge()
    bay.fit(xtrain, ytrain)
    a=bay.score(xtrain, ytrain)
    b=bay.score(xtest, ytest)
    c=mean_squared_error(ytest,bay.predict(xtest) )
    d=mean_absolute_error(ytest,bay.predict(xtest) )

    
    train_bay_score_Lst4.append(a)
    test_bay_score_Lst4.append(b)
    test_bay_mse_Lst4.append(c)
    test_bay_mae_Lst4.append(d)    
    
#pd.DataFrame(list(zip(test_bay_mse_Lst4,test_bay_mae_Lst4)), columns=["MSE", "MAE"])                    
In [53]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_bay_mse_Lst1,test_bay_mae_Lst1,
                             test_bay_mse_Lst2,test_bay_mae_Lst2,
                             test_bay_mse_Lst3,test_bay_mae_Lst3,
                             test_bay_mse_Lst4,test_bay_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [54]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Bayesian Error")
plt.ylabel("MSE / MAE")
plt.show()
In [60]:
scorePlot(window_slide_Lst, train_bay_score_Lst, test_bay_score_Lst, "Bayesian Regressor")

Random Forest Regressor

In [92]:
from sklearn.ensemble import RandomForestRegressor

train_rfr_score_Lst1=[]
test_rfr_score_Lst1=[]
test_rfr_mse_Lst1=[]
test_rfr_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4, timestep=window_slide, normalize=True)

    
    rfr= RandomForestRegressor(n_estimators=100)
    rfr.fit(xtrain, ytrain)
    a=rfr.score(xtrain, ytrain)
    b=rfr.score(xtest, ytest)
    c=mean_squared_error(ytest,rfr.predict(xtest) )
    d=mean_absolute_error(ytest,rfr.predict(xtest) )    
    
    train_rfr_score_Lst1.append(a)
    test_rfr_score_Lst1.append(b)
    test_rfr_mse_Lst1.append(c)
    test_rfr_mae_Lst1.append(d)    
    
#pd.DataFrame(list(zip(test_rfr_mse_Lst1,test_rfr_mae_Lst1)), columns=["MSE", "MAE"])                        
    
In [93]:
from sklearn.ensemble import RandomForestRegressor

train_rfr_score_Lst2=[]
test_rfr_score_Lst2=[]
test_rfr_mse_Lst2=[]
test_rfr_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                    timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                timestep=window_slide, normalize=True)

    
    rfr= RandomForestRegressor(n_estimators=100)
    rfr.fit(xtrain, ytrain)
    a=rfr.score(xtrain, ytrain)
    b=rfr.score(xtest, ytest)
    c=mean_squared_error(ytest,rfr.predict(xtest) )
    d=mean_absolute_error(ytest,rfr.predict(xtest) )    
    
    train_rfr_score_Lst2.append(a)
    test_rfr_score_Lst2.append(b)
    test_rfr_mse_Lst2.append(c)
    test_rfr_mae_Lst2.append(d)    
    
#pd.DataFrame(list(zip(test_rfr_mse_Lst2,test_rfr_mae_Lst2)), columns=["MSE", "MAE"])                        
    
In [94]:
from sklearn.ensemble import RandomForestRegressor

train_rfr_score_Lst3=[]
test_rfr_score_Lst3=[]
test_rfr_mse_Lst3=[]
test_rfr_mae_Lst3=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                    timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                timestep=window_slide, normalize=True)

    
    rfr= RandomForestRegressor(n_estimators=100)
    rfr.fit(xtrain, ytrain)
    a=rfr.score(xtrain, ytrain)
    b=rfr.score(xtest, ytest)
    c=mean_squared_error(ytest,rfr.predict(xtest) )
    d=mean_absolute_error(ytest,rfr.predict(xtest) )    
    
    train_rfr_score_Lst3.append(a)
    test_rfr_score_Lst3.append(b)
    test_rfr_mse_Lst3.append(c)
    test_rfr_mae_Lst3.append(d)    
    
#pd.DataFrame(list(zip(test_rfr_mse_Lst3,test_rfr_mae_Lst3)), columns=["MSE", "MAE"])                        
    
In [95]:
from sklearn.ensemble import RandomForestRegressor

train_rfr_score_Lst4=[]
test_rfr_score_Lst4=[]
test_rfr_mse_Lst4=[]
test_rfr_mae_Lst4=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                    timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                timestep=window_slide, normalize=True)

    
    rfr= RandomForestRegressor(n_estimators=100)
    rfr.fit(xtrain, ytrain)
    a=rfr.score(xtrain, ytrain)
    b=rfr.score(xtest, ytest)
    c=mean_squared_error(ytest,rfr.predict(xtest) )
    d=mean_absolute_error(ytest,rfr.predict(xtest) )    
    
    train_rfr_score_Lst4.append(a)
    test_rfr_score_Lst4.append(b)
    test_rfr_mse_Lst4.append(c)
    test_rfr_mae_Lst4.append(d)    
    
#pd.DataFrame(list(zip(test_rfr_mse_Lst4,test_rfr_mae_Lst4)), columns=["MSE", "MAE"])                        
    
In [96]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_rfr_mse_Lst1,test_rfr_mae_Lst1,
                             test_rfr_mse_Lst2,test_rfr_mae_Lst2,
                             test_rfr_mse_Lst3,test_rfr_mae_Lst3,
                             test_rfr_mse_Lst4,test_rfr_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [97]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Random Forest Regressor Error")
plt.ylabel("MSE / MAE")
plt.show()
In [62]:
scorePlot(window_slide_Lst, train_rfr_score_Lst, test_rfr_score_Lst, "Random Forest Regressor")

Gradient Boosting Regressor

In [61]:
from sklearn.ensemble import GradientBoostingRegressor

train_gbr_score_Lst1=[]
test_gbr_score_Lst1=[]
test_gbr_mse_Lst1=[]
test_gbr_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4, timestep=window_slide, normalize=True)

    gbr= GradientBoostingRegressor(n_estimators=200)
    gbr.fit(xtrain, ytrain)
    a=gbr.score(xtrain, ytrain)
    b=gbr.score(xtest, ytest)
    c=mean_squared_error(ytest,gbr.predict(xtest) )
    d=mean_absolute_error(ytest,gbr.predict(xtest) )    

    train_gbr_score_Lst1.append(a)
    test_gbr_score_Lst1.append(b)
    test_gbr_mse_Lst1.append(c)
    test_gbr_mae_Lst1.append(d)    
    
#pd.DataFrame(list(zip(test_gbr_mse_Lst1,test_gbr_mae_Lst1)), columns=["MSE", "MAE"])                         
In [62]:
from sklearn.ensemble import GradientBoostingRegressor

train_gbr_score_Lst2=[]
test_gbr_score_Lst2=[]
test_gbr_mse_Lst2=[]
test_gbr_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    gbr= GradientBoostingRegressor(n_estimators=200)
    gbr.fit(xtrain, ytrain)
    a=gbr.score(xtrain, ytrain)
    b=gbr.score(xtest, ytest)
    c=mean_squared_error(ytest,gbr.predict(xtest) )
    d=mean_absolute_error(ytest,gbr.predict(xtest) )    

    train_gbr_score_Lst2.append(a)
    test_gbr_score_Lst2.append(b)
    test_gbr_mse_Lst2.append(c)
    test_gbr_mae_Lst2.append(d)    
    
#pd.DataFrame(list(zip(test_gbr_mse_Lst2,test_gbr_mae_Lst2)), columns=["MSE", "MAE"])                         
In [63]:
from sklearn.ensemble import GradientBoostingRegressor

train_gbr_score_Lst3=[]
test_gbr_score_Lst3=[]
test_gbr_mse_Lst3=[]
test_gbr_mae_Lst3=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    gbr= GradientBoostingRegressor(n_estimators=200)
    gbr.fit(xtrain, ytrain)
    a=gbr.score(xtrain, ytrain)
    b=gbr.score(xtest, ytest)
    c=mean_squared_error(ytest,gbr.predict(xtest) )
    d=mean_absolute_error(ytest,gbr.predict(xtest) )    

    train_gbr_score_Lst3.append(a)
    test_gbr_score_Lst3.append(b)
    test_gbr_mse_Lst3.append(c)
    test_gbr_mae_Lst3.append(d)    
    
#pd.DataFrame(list(zip(test_gbr_mse_Lst3,test_gbr_mae_Lst3)), columns=["MSE", "MAE"])                         
In [64]:
from sklearn.ensemble import GradientBoostingRegressor

train_gbr_score_Lst4=[]
test_gbr_score_Lst4=[]
test_gbr_mse_Lst4=[]
test_gbr_mae_Lst4=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)

    gbr= GradientBoostingRegressor(n_estimators=200)
    gbr.fit(xtrain, ytrain)
    a=gbr.score(xtrain, ytrain)
    b=gbr.score(xtest, ytest)
    c=mean_squared_error(ytest,gbr.predict(xtest) )
    d=mean_absolute_error(ytest,gbr.predict(xtest) )    

    ypred=gbr.predict(xtest)
    train_gbr_score_Lst4.append(a)
    test_gbr_score_Lst4.append(b)
    test_gbr_mse_Lst4.append(c)
    test_gbr_mae_Lst4.append(d)    
    
#pd.DataFrame(list(zip(test_gbr_mse_Lst4,test_gbr_mae_Lst4)), columns=["MSE", "MAE"])                         
In [65]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_gbr_mse_Lst1,test_gbr_mae_Lst1,
                             test_gbr_mse_Lst2,test_gbr_mae_Lst2,
                             test_gbr_mse_Lst3,test_gbr_mae_Lst3,
                             test_gbr_mse_Lst4,test_gbr_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [66]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Gradient Boost Regressor Error")
plt.ylabel("MSE / MAE")
plt.show()
In [64]:
scorePlot(window_slide_Lst, train_gbr_score_Lst, test_gbr_score_Lst, "Gradient Boosting Regressor")

Adaboost Regressor

In [67]:
from sklearn.ensemble import AdaBoostRegressor

train_abr_score_Lst1=[]
test_abr_score_Lst1=[]
test_abr_mse_Lst1=[]
test_abr_mae_Lst1=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4, timestep=window_slide, normalize=True)


    abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
    abr.fit(xtrain, ytrain)
    a=abr.score(xtrain, ytrain)
    b=abr.score(xtest, ytest)
    c=mean_squared_error(ytest,abr.predict(xtest) )
    d=mean_absolute_error(ytest,abr.predict(xtest) )    
    
    train_abr_score_Lst1.append(a)
    test_abr_score_Lst1.append(b)
    test_abr_mse_Lst1.append(c)
    test_abr_mae_Lst1.append(d)    
    
#pd.DataFrame(list(zip(test_abr_mse_Lst1,test_abr_mae_Lst1)), columns=["MSE", "MAE"])                             
In [68]:
from sklearn.ensemble import AdaBoostRegressor

train_abr_score_Lst2=[]
test_abr_score_Lst2=[]
test_abr_mse_Lst2=[]
test_abr_mae_Lst2=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C",  "Unemploy","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)


    abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
    abr.fit(xtrain, ytrain)
    a=abr.score(xtrain, ytrain)
    b=abr.score(xtest, ytest)
    c=mean_squared_error(ytest,abr.predict(xtest) )
    d=mean_absolute_error(ytest,abr.predict(xtest) )    
    
    train_abr_score_Lst2.append(a)
    test_abr_score_Lst2.append(b)
    test_abr_mse_Lst2.append(c)
    test_abr_mae_Lst2.append(d)    
    
#pd.DataFrame(list(zip(test_abr_mse_Lst2,test_abr_mae_Lst2)), columns=["MSE", "MAE"])                             
In [69]:
from sklearn.ensemble import AdaBoostRegressor

train_abr_score_Lst3=[]
test_abr_score_Lst3=[]
test_abr_mse_Lst3=[]
test_abr_mae_Lst3=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)


    abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
    abr.fit(xtrain, ytrain)
    a=abr.score(xtrain, ytrain)
    b=abr.score(xtest, ytest)
    c=mean_squared_error(ytest,abr.predict(xtest) )
    d=mean_absolute_error(ytest,abr.predict(xtest) )    
    
    train_abr_score_Lst3.append(a)
    test_abr_score_Lst3.append(b)
    test_abr_mse_Lst3.append(c)
    test_abr_mae_Lst3.append(d)    
    
#pd.DataFrame(list(zip(test_abr_mse_Lst3,test_abr_mae_Lst3)), columns=["MSE", "MAE"])                             
In [70]:
from sklearn.ensemble import AdaBoostRegressor

train_abr_score_Lst4=[]
test_abr_score_Lst4=[]
test_abr_mse_Lst4=[]
test_abr_mae_Lst4=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=window_slide, normalize=True)
    xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=window_slide, normalize=True)


    abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
    abr.fit(xtrain, ytrain)
    a=abr.score(xtrain, ytrain)
    b=abr.score(xtest, ytest)
    c=mean_squared_error(ytest,abr.predict(xtest) )
    d=mean_absolute_error(ytest,abr.predict(xtest) )    
    
    train_abr_score_Lst4.append(a)
    test_abr_score_Lst4.append(b)
    test_abr_mse_Lst4.append(c)
    test_abr_mae_Lst4.append(d)    
    
#pd.DataFrame(list(zip(test_abr_mse_Lst4,test_abr_mae_Lst4)), columns=["MSE", "MAE"])                             
In [71]:
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
                             test_abr_mse_Lst1,test_abr_mae_Lst1,
                             test_abr_mse_Lst2,test_abr_mae_Lst2,
                             test_abr_mse_Lst3,test_abr_mae_Lst3,
                             test_abr_mse_Lst4,test_abr_mae_Lst4)), columns=["timestep",
                                                                             "MSE-set1", "MAE-set1",
                                                                             "MSE-set2", "MAE-set2",
                                                                             "MSE-set3", "MAE-set3",
                                                                             "MSE-set4", "MAE-set4"])  
plotDF=plotDF.set_index("timestep")
In [72]:
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("AdaBoost Regressor Error")
plt.ylabel("MSE / MAE")
plt.show()
In [72]:
scorePlot(window_slide_Lst, train_abr_score_Lst, test_abr_score_Lst, "AdaBoost Regressor")

Error Comparison:

Based on the following MSE and MAE error comparison plots, we could easily to obersve the distinctive models: XGB and Gradient Boosting Regressor are outperformed other models. The optimal window slide six seems to be constantly the best timestep for ALL models in this dataset. Timestep=6 seems reasonable for the domain of this project, model utilize the prior five weeks information to forecast the sixth day Sales. I would say that this performance based on the test set of data is improved from the baseline model by using the set 4 of features (i.e. "Store_size", "type_A", "type_C"), The error metrics MSE is reduced 0.108 from to 0.0128 while MAE is reducd from 0.238 to 0.08. The fine tuned model generates roughly four times less error than the baseline model. This result actually shows that the primary goal of this project is achieved by searching out the best model make prediction for Weekly Sales fro Walmart Stores

MAE Comparision Plot

In [98]:
df= pd.DataFrame(list(zip(window_slide_Lst,
                          test_xgb_mae_Lst4,
                          test_ridge_mae_Lst4,
                          test_lasso_mae_Lst4,
                          test_sgd_mae_Lst4,
                          test_bay_mae_Lst4,
                          test_rfr_mae_Lst4,
                          test_gbr_mae_Lst4,
                          test_abr_mae_Lst4)), columns=["timestep",
                                                       "xgb", "LR-Ridge", "LR-Lasso", 
                                                       "SGD Regressor", 
                                                       "Bayesian Regressor",
                                                       "Random Forest Regressor",
                                                       "Gradient Boosting Regressor",
                                                        "ada Boost Regressor"
                                                      ])
df=df.set_index("timestep")


df.plot(use_index=True, figsize=(15,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("MAE Comparision")
plt.ylabel ("MAE")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,11))
plt.show()

MSE Comparision Plot

In [101]:
df= pd.DataFrame(list(zip(window_slide_Lst,
                          test_xgb_mse_Lst4,
                          test_ridge_mse_Lst4,
                          test_lasso_mse_Lst4,
                          test_sgd_mse_Lst4,
                          test_bay_mse_Lst4,
                          test_rfr_mse_Lst4,
                          test_gbr_mse_Lst4,
                          test_abr_mse_Lst4)), columns=["timestep",
                                                       "xgb", "LR-Ridge", "LR-Lasso", 
                                                       "SGD Regressor", 
                                                       "Bayesian Regressor",
                                                       "Random Forest Regressor",
                                                       "Gradient Boosting Regressor",
                                                        "ada Boost Regressor"
                                                      ])
df=df.set_index("timestep")


df.plot(use_index=True, figsize=(15,6))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("MSE Comparision")
plt.ylabel ("MSE Comparision")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,11))
plt.show()

Grid Search with Time step = 6

This part shows us the best hyperparameters with highest accuracy of the models. We will grid search out the best three models shown above which are XGB, Gradient Boosting Regressor and Random Forest Regressor. Also, we will set up the timestep = 6 which offers the best performance so far based on the above analysis.

In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor 

xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=6, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=6, normalize=True)

gbr= GradientBoostingRegressor(random_state = 0)


grid_params = {  
    'n_estimators': [100, 200,300,500,800,1000],
    "learning_rate": [0.001,0.01,0.1,1]

}

model = GridSearchCV(estimator=gbr,
                     param_grid=grid_params, 
                     cv = 5,
                     #scoring = make_scorer(mean_squared_error, greater_is_better=False)
                     scoring = "neg_mean_squared_error")
model.fit(xtrain, ytrain)
print("Best Parameters of Gradient Boosting Regressor:",model.best_params_)
print("Best Score:",model.best_score_)
Best Parameters of Gradient Boosting Regressor: {'learning_rate': 0.01, 'n_estimators': 100}
Best Score: -0.1950595015438056
In [67]:
gbr= GradientBoostingRegressor(random_state = 0,
                               n_estimators = 100,
                               learning_rate=0.01
)
gbr.fit(xtrain, ytrain)
gbr_mse=mean_squared_error(ytest, gbr.predict(xtest))
print("Train Set MSE",mean_squared_error(ytrain, gbr.predict(xtrain)))
print("Test Set MSE",gbr_mse)
Train Set MSE 0.026306843817567403
Test Set MSE 0.014600885220125773
In [51]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor 

xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=6, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=6, normalize=True)

xgb= XGBRegressor(random_state = 0)


grid_params = {  
   'gamma':[i/10.0 for i in range(3,8)],  
    'subsample':[i/10.0 for i in range(5,10)],
}

model = GridSearchCV(estimator=xgb,
                     param_grid=grid_params, 
                     cv = 5,
                     #scoring = make_scorer(mean_squared_error, greater_is_better=False)
                     scoring = "neg_mean_squared_error")
model.fit(xtrain, ytrain)
print("Best Parameters of XGB Regressor:",model.best_params_)
print("Best Score:",model.best_score_)
Best Parameters of XGB Regressor: {'gamma': 0.7, 'subsample': 0.9}
Best Score: -0.24712110217553182
In [68]:
xgb= XGBRegressor(random_state = 0,
                  gamma= 0.7, 
                  subsample= 0.9                              
)
xgb.fit(xtrain, ytrain)
xgb_mse=mean_squared_error(ytest, xgb.predict(xtest))
print("Train Set MSE",mean_squared_error(ytrain, xgb.predict(xtrain)))
print("Test Set MSE",xgb_mse)
Train Set MSE 0.029118472394601417
Test Set MSE 0.01722268301469622
In [63]:
from sklearn.ensemble import RandomForestRegressor

xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                                  timestep=6, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]], 
                                                              timestep=6, normalize=True)

rfr= RandomForestRegressor(random_state = 0)


grid_params = {
    'max_depth': [10,  30,  50,  70,  90, 100, None],
    "max_features" : ["auto", "sqrt"]
    'n_estimators': [200, 400, 600, 800, 1000]}

model = GridSearchCV(estimator=rfr,
                     param_grid=grid_params, 
                     cv = 5,
                     #scoring = make_scorer(mean_squared_error, greater_is_better=False)
                     scoring = "neg_mean_squared_error")
model.fit(xtrain, ytrain)
print("Best Parameters of RFR Regressor:",model.best_params_)
print("Best Score:",model.best_score_)
Best Parameters of XGB Regressor: {'max_depth': 10, 'n_estimators': 1000}
Best Score: -0.2910143101609217
In [108]:
rfr= RandomForestRegressor(random_state = 0,
                           max_depth=10,
                           n_estimators=1000
                          )
rfr.fit(xtrain, ytrain)
rfr_mse = mean_squared_error(ytest, rfr.predict(xtest))
print("Train Set MSE",mean_squared_error(ytrain, rfr.predict(xtrain)))
print("Test Set MSE",rfr_mse)
Train Set MSE 0.024341917395239318
Test Set MSE 0.012896143446045928

Result

The chart below illustrates that the three best models have very close MSE values. Among the regressors, The least error value which is generated by Random Forest model is about 0.0129; second place is Gradient Boosting; The third is XGB. Hyper-parameters which generated the optimizd result for the models are listed as below.

  • Gradient Boosting Regressor: 'learning_rate': 0.01, 'n_estimators': 100
  • XGB Regressor: 'gamma': 0.7, 'subsample': 0.9
  • Random Forest Regressor: 'max_depth': 10, 'n_estimators': 1000
In [109]:
import matplotlib.pyplot as plt
%matplotlib inline
labels=["XGB ","Gradient Boosting", "Random Forest  " ]
fig=plt.figure()
ax1= fig.add_subplot(1,1,1)
ax1.bar(x=labels,
        height=[xgb_mse,gbr_mse, rfr_mse, ],
        color=["red", "yellow", "blue"],
        width = 0.5)
ax1.set_title("Regressor Error Comparison", fontsize=18)
ax1.set_ylabel("Mean_Square_Error", fontsize=15)
ax1.set_xlabel("Models", fontsize=15)
ax1.set_xticklabels(labels, ha="center")
plt.show()

Goal achieved

  • Discover the correlation of between features and Sales.
  • Discover the features contributes differently to the predictive models
  • Determine the best model(s) for forecasting the weekly sales (continuous values)

Future Work

  • Fine tune the parameters to further optimize the performance
  • Attempt to utilize Deep Learning method (e.g. LSTM)
  • Research more representable Feature out of data